cm p - lg / 9 40 50 01 2 M ay 1 99 4 Similarity - Based Estimation of Word Cooccurrence Probabilities ∗

نویسندگان

  • Ido Dagan
  • Fernando Pereira
  • Lillian Lee
چکیده

In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach” and “eat a beach” is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in a given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on “most similar” words. We describe a probabilistic word association model based on distributional word similarity, and apply it to improving probability estimates for unseen word bigrams in a variant of Katz’s back-off model. The similarity-based method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ar X iv : c m p - lg / 9 40 50 01 v 1 2 M ay 1 99 4 Similarity - Based Estimation of Word Cooccurrence Probabilities ∗

In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach” and “eat a beach” is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However, the n...

متن کامل

ar X iv : c m p - lg / 9 60 50 14 v 1 1 2 M ay 1 99 6 Clustering Words with the MDL Principle

We address the problem of automatically constructing a thesaurus by clustering words based on corpus data. We view this problem as that of estimating a joint distribution over the Cartesian product of a partition of a set of nouns and a partition of a set of verbs, and propose a learning algorithm based on the Minimum Description Length (MDL) Principle for such estimation. We empirically compar...

متن کامل

Spectral Duality for Planar Billiards

ao -d yn /9 40 50 01 2 M ay 1 99 4 Spectral Duality for Planar Billiards J.-P. Eckmann1;2 and C.-A. Pillet1 Dépt. de Physique Théorique, Université de Genève, CH-1211 Genève 4, Switzerland Section de Mathématiques, Université de Genève, CH-1211 Genève 4, Switzerland Abstract. For a bounded open domain with connected complement in R2 and piecewise smooth boundary, we consider the Dirichlet Lapla...

متن کامل

ar X iv : q - a lg / 9 70 50 27 v 1 2 8 M ay 1 99 7 Jordanian U h , s gl ( 2 ) and its coloured realization

A two-parametric non-standard (Jordanian) deformation of the Lie algebra gl(2) is constructed, and then, exploited to obtain a new, triangular R-matrix solution of the coloured Yang-Baxter equation. The corresponding coloured quantum group is presented explicitly.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994